Skip to content

fix(pairs): discriminate OM/IEM forecasts by source, not issued_at (#67)#69

Merged
helloiamvu merged 6 commits into
merged-visionfrom
fix/issue-67-om-source-discriminator
Jun 6, 2026
Merged

fix(pairs): discriminate OM/IEM forecasts by source, not issued_at (#67)#69
helloiamvu merged 6 commits into
merged-visionfrom
fix/issue-67-om-source-discriminator

Conversation

@helloiamvu

Copy link
Copy Markdown
Member

Summary

Fixes #67. build_pairs_row() split forecast records by issued_at presence, but Phase 20+ Open-Meteo rows carry a derived issued_at, so they were misrouted into the IEM MOS aggregation path — silently nulling forecast temps and polluting IEM run-selection when both sources are combined.

Discriminate by the authoritative source field instead (open_meteo* → Open-Meteo, else IEM), matching the contract research._fetch_open_meteo_range already documents ("discriminates via row.get('source')").

Changes

  • _is_open_meteo_record(): source prefixed open_meteo OR legacy no-source/no-issued_at shape (real IEM always carries issued_at, so a record missing both can only be legacy OM — keeps backward-compat).
  • OM aggregation falls back to a pre-converted temperature_f (the shape the research wrapper emits) so source-routed rows don't aggregate to null.
  • OM branch now preserves pop_6hr_pct / qpf_6hr_in and fcst_issued_at provenance that previously came via the IEM branch.
  • Leakage guard: OM rows from a run issued after market close are excluded from temp/POP/QPF aggregation (not just the timestamp) — mirrors _select_best_run's cutoff. The SDK's temporal-safety invariant.

Tests (TDD)

8 new regression tests in test_pairs.py covering: derived-issued_at classification, IEM-still-preferred, research-wrapper temperature_f shape, pop/qpf survival, zero-POP, legacy source-less shape, issued_at provenance, and after-close leakage exclusion. Parity gate (tests/test_parity.py) unchanged.

Review

Codex (gpt-5.5, medium) — 4 iterations: 1 P1 (leakage) + 3 P2 (precip regression, legacy shape, provenance) all fixed; final review clean.

TS Parity

Python-internal forecast-join discriminator fix; no public API surface change (same research() columns). The TS twin's buildPairsRow equivalent should get a parity ticket per CROSS-SDK-SYNC.md to apply the same source-based split — tracked separately.

Phase 20+ Open-Meteo rows carry a derived issued_at, so the old
issued_at-presence split in build_pairs_row() misrouted them into the
IEM MOS aggregation path — silently nulling forecast temps and polluting
IEM run-selection when both sources are combined.

Discriminate by the authoritative source field instead (open_meteo* ->
Open-Meteo, else IEM), matching the contract research._fetch_open_meteo_range
already documents. Also teach _aggregate_fcst_temps_openmeteo to fall back
to a pre-converted temperature_f so source-discriminated rows from the
research() wrapper (which emits temperature_f, not temperature_c) aggregate
correctly instead of returning null.

TDD: 3 new regression tests in test_pairs.py cover the derived-issued_at
classification, IEM-still-preferred, and the research-wrapper temperature_f
shape.
Codex review caught a regression: research._fetch_open_meteo_range emits
OM rows carrying pop_6hr_pct / qpf_6hr_in (not precipitation_probability_pct
and no QPF read). Pre-#67 those rows flowed through the IEM branch, which
set both fields; after source-routing they hit the OM branch, which only
read precipitation_probability_pct and never set QPF — silently nulling
precip columns for research(forecast_source="open_meteo").

OM branch now accepts the pop_6hr_pct alias (explicit None-checks keep a
valid 0.0) and sums qpf_6hr_in over the window, matching IEM semantics.
+2 regression tests.
…odex P2)

Codex flagged that the pure source-prefix split regressed the previously
documented Open-Meteo shape (no source, no issued_at, temperature_c) to the
IEM branch -> null temps. Extract _is_open_meteo_record(): source prefixed
open_meteo OR (no source AND no issued_at). Real IEM rows always carry an
issued_at, so a record missing both can only be legacy OM. +1 regression test.
… codex P2)

Codex flagged leakage-safety provenance loss: Phase 20+ OM rows carry a
derived issued_at; pre-#67 the IEM branch set fcst_issued via _select_best_run,
but after source routing the OM branch left fcst_issued_at None for
forecast_source="open_meteo". Restore it as the most-recent OM issued_at
at-or-before market close — never leaking a run issued after settlement.
+2 regression tests.
…P1 leakage)

Codex P1: my provenance fix filtered issued_at only for the displayed
timestamp; the OM aggregation still summed window rows issued AFTER market
close, leaking their temp/POP/QPF into the training pair (lookahead). Pre-#67
_select_best_run applied this cutoff; the OM branch did not. Now filter
window_om by issued_at <= market_close before aggregating temps/POP/QPF/
issued_at. Legacy source-less rows (no issued_at) are kept — nothing to leak.
Strengthened test asserts the after-close temp is excluded, not just hidden.
@helloiamvu helloiamvu requested a review from Tarabcak June 6, 2026 13:02
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

Parity ticket gate: PASSED

parity-ticket-check: PR does not touch parity-trigger surface; gate skipped.

See CROSS-SDK-SYNC.md §2 for the workflow.

@helloiamvu helloiamvu merged commit 237c437 into merged-vision Jun 6, 2026
17 checks passed
helloiamvu added a commit that referenced this pull request Jun 6, 2026
…h() docs (#52)

Bump all three packages 1.5.2 -> 1.5.3.

- #67 (#69): build_pairs_row() now discriminates OM/IEM forecasts by source
  instead of issued_at presence, fixing silent null temps + IEM run-selection
  pollution for Phase 20+ multi-source callers; closes a lookahead-leakage
  path for after-close OM runs. Parity gate unaffected.
- #52 (#70): research() docstring clarifies daily-row return granularity.
@helloiamvu helloiamvu deleted the fix/issue-67-om-source-discriminator branch June 18, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant